137 research outputs found
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
Transfer learning has fundamentally changed the landscape of natural language
processing (NLP) research. Many existing state-of-the-art models are first
pre-trained on a large text corpus and then fine-tuned on downstream tasks.
However, due to limited data resources from downstream tasks and the extremely
large capacity of pre-trained models, aggressive fine-tuning often causes the
adapted model to overfit the data of downstream tasks and forget the knowledge
of the pre-trained model. To address the above issue in a more principled
manner, we propose a new computational framework for robust and efficient
fine-tuning for pre-trained language models. Specifically, our proposed
framework contains two important ingredients: 1. Smoothness-inducing
regularization, which effectively manages the capacity of the model; 2. Bregman
proximal point optimization, which is a class of trust-region methods and can
prevent knowledge forgetting. Our experiments demonstrate that our proposed
method achieves the state-of-the-art performance on multiple NLP benchmarks.Comment: The 58th annual meeting of the Association for Computational
Linguistics (ACL 2020
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
This paper presents a new pre-trained language model, DeBERTaV3, which
improves the original DeBERTa model by replacing mask language modeling (MLM)
with replaced token detection (RTD), a more sample-efficient pre-training task.
Our analysis shows that vanilla embedding sharing in ELECTRA hurts training
efficiency and model performance. This is because the training losses of the
discriminator and the generator pull token embeddings in different directions,
creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled
embedding sharing method that avoids the tug-of-war dynamics, improving both
training efficiency and the quality of the pre-trained model. We have
pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its
exceptional performance on a wide range of downstream natural language
understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an
example, the DeBERTaV3 Large model achieves a 91.37% average score, which is
1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art
(SOTA) among the models with a similar structure. Furthermore, we have
pre-trained a multi-lingual model mDeBERTa and observed a larger improvement
over strong baselines compared to English models. For example, the mDeBERTa
Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6%
improvement over XLM-R Base, creating a new SOTA on this benchmark. We have
made our pre-trained models and inference code publicly available at
https://github.com/microsoft/DeBERTa.Comment: 16 pages, 10 tables, 2 Figures. The DeBERTaV3 model significantly
improves performance of the downstream NLU tasks over models with a similar
structure, e.g. DeBERTaV3 large achieves 91.37% average GLUE score which is
1.37% over DeBERTa large. XSmall has only 22M backbone parameters, but
significantly outperforms RoBERTa/XLNet-bas
- …